NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Accurate short-read alignment through r-index-based pangenome indexing

https://doi.org/10.1101/gr.279858.124

Varki, Rahul; Rossi, Massimiliano; Ferro, Eddie; Oliva, Marco; Garrison, Erik; Langmead, Ben; Boucher, Christina (June 2025, Genome Research)

Aligning to a linear reference genome can result in a higher percentage of reads going unmapped or being incorrectly mapped owing to variations not captured by the reference, otherwise known as reference bias. Recently, in efforts to mitigate reference bias, there has been a movement to switch to using pangenomes, a collection of genomes, as the reference. In this paper, we introduce Moni-align, the first short-read pangenome aligner built on the r-index, a variation of the classical FM-index that can index collections of genomes in O(r)-space, whereris the number of runs in the Burrows–Wheeler transform. Moni-align uses a seed-and-extend strategy for aligning reads, utilizing maximal exact matches as seeds, which can be efficiently obtained with ther-index. Using both simulated and real short-read data sets, we demonstrate that Moni-align achieves alignment accuracy comparable to vg map and vg giraffe, the leading pangenome aligners. Although currently best suited for aligning to localized pangenomes owing to computational constraints, Moni-align offers a robust foundation for future optimizations that could further broaden its applicability.
more » « less
Full Text Available
PangenomicsBench: A Benchmark Suite and Characterization of Pangenomics

https://doi.org/10.1109/IISWC66894.2025.00031

Kaplan, Noah; Schmelzle, Jan-Niklas; Gu, Yufeng; Garrison, Erik; Batten, Christopher; Das, Reetuparna (October 2025, IEEE)

Full Text Available
wgatools : an ultrafast toolkit for manipulating whole-genome alignments

https://doi.org/10.1093/bioinformatics/btaf132

Wei, Wenjie; Gui, Songtao; Yang, Jian; Garrison, Erik; Yan, Jianbing; Liu, Hai-Jun (March 2025, Bioinformatics)
Alkan, Can (Ed.)
Abstract SummaryWith the rapid development of long-read sequencing technologies, the era of individual complete genomes is approaching. We have developed wgatools, a cross-platform, ultrafast toolkit that supports a range of whole-genome alignment formats, offering practical tools for conversion, processing, evaluation, and visualization of alignments, thereby facilitating population-level genome analysis and advancing functional and evolutionary genomics. Availability and implementationwgatools supports diverse formats and can process, filter, and statistically evaluate alignments, perform alignment-based variant calling, and visualize alignments both locally and genome-wide. Built with Rust for efficiency and safe memory usage, it ensures fast performance and can handle large datasets consisting of hundreds of genomes. wgatools is published as free software under the MIT open-source license, and its source code is freely available at https://github.com/wjwei-handsome/wgatools and https://zenodo.org/records/14882797.
more » « less
Full Text Available
Rapid GPU-Based Pangenome Graph Layout

https://doi.org/10.1109/SC41406.2024.00035

Li, Jiajie; Schmelzle, Jan-Niklas; Du, Yixiao; Heumos, Simon; Guarracino, Andrea; Guidi, Giulia; Prins, Pjotr; Garrison, Erik; Zhang, Zhiru (November 2024, IEEE)

Full Text Available
Cluster-efficient pangenome graph construction with nf-core/pangenome

https://doi.org/10.1093/bioinformatics/btae609

Heumos, Simon; Heuer, Michael L; Hanssen, Friederike; Heumos, Lukas; Guarracino, Andrea; Heringer, Peter; Ehmele, Philipp; Prins, Pjotr; Garrison, Erik; Nahnsen, Sven (November 2024, Bioinformatics)
Alkan, Can (Ed.)
Abstract MotivationPangenome graphs offer a comprehensive way of capturing genomic variability across multiple genomes. However, current construction methods often introduce biases, excluding complex sequences or relying on references. The PanGenome Graph Builder (PGGB) addresses these issues. To date, though, there is no state-of-the-art pipeline allowing for easy deployment, efficient and dynamic use of available resources, and scalable usage at the same time. ResultsTo overcome these limitations, we present nf-core/pangenome, a reference-unbiased approach implemented in Nextflow following nf-core’s best practices. Leveraging biocontainers ensures portability and seamless deployment in High-Performance Computing (HPC) environments. Unlike PGGB, nf-core/pangenome distributes alignments across cluster nodes, enabling scalability. Demonstrating its efficiency, we constructed pangenome graphs for 1000 human chromosome 19 haplotypes and 2146 Escherichia coli sequences, achieving a two to threefold speedup compared to PGGB without increasing greenhouse gas emissions. Availability and implementationnf-core/pangenome is released under the MIT open-source license, available on GitHub and Zenodo, with documentation accessible at https://nf-co.re/pangenome/docs/usage.
more » « less
Full Text Available
Pangenome-Informed Language Models for Synthetic Genome Sequence Generation

https://doi.org/10.1101/2024.09.18.612131

Huang, Pengzhi; Charton, François; Schmelzle, Jan-Niklas M; Darnell, Shelby S; Prins, Pjotr; Garrison, Erik; Suh, G Edward (September 2024, bioRxiv)

Abstract Language Models (LM) have been extensively utilized for learning DNA sequence patterns and generating synthetic sequences. In this paper, we present a novel approach for the generation of synthetic DNA data using pangenomes in combination with LM. We introduce three innovative pangenome-based tokenization schemes, including two that can decouple from private data, while enhance long DNA sequence generation. Our experimental results demonstrate the superiority of pangenome-based tokenization over classical methods in generating high-utility synthetic DNA sequences, highlighting a promising direction for the public sharing of genomic datasets.
more » « less
Full Text Available
Recurrent evolution and selection shape structural diversity at the amylase locus

https://doi.org/10.1038/s41586-024-07911-1

Bolognini, Davide; Halgren, Alma; Lou, Runyang Nicolas; Raveane, Alessandro; Rocha, Joana L; Guarracino, Andrea; Soranzo, Nicole; Chin, Chen-Shan; Garrison, Erik; Sudmant, Peter H (October 2024, Nature)

Abstract The adoption of agriculture triggered a rapid shift towards starch-rich diets in human populations¹. Amylase genes facilitate starch digestion, and increased amylase copy number has been observed in some modern human populations with high-starch intake², although evidence of recent selection is lacking^3,4. Here, using 94 long-read haplotype-resolved assemblies and short-read data from approximately 5,600 contemporary and ancient humans, we resolve the diversity and evolutionary history of structural variation at the amylase locus. We find that amylase genes have higher copy numbers in agricultural populations than in fishing, hunting and pastoral populations. We identify 28 distinct amylase structural architectures and demonstrate that nearly identical structures have arisen recurrently on different haplotype backgrounds throughout recent human history.AMY1andAMY2Agenes each underwent multiple duplication/deletion events with mutation rates up to more than 10,000-fold the single-nucleotide polymorphism mutation rate, whereasAMY2Bgene duplications share a single origin. Using a pangenome-based approach, we infer structural haplotypes across thousands of humans identifying extensively duplicated haplotypes at higher frequency in modern agricultural populations. Leveraging 533 ancient human genomes, we find that duplication-containing haplotypes (with more gene copies than the ancestral haplotype) have rapidly increased in frequency over the past 12,000 years in West Eurasians, suggestive of positive selection. Together, our study highlights the potential effects of the agricultural revolution on human genomes and the importance of structural variation in human adaptation.
more » « less
Full Text Available
Pangenome graph layout by Path-Guided Stochastic Gradient Descent

https://doi.org/10.1093/bioinformatics/btae363

Heumos, Simon; Guarracino, Andrea; Schmelzle, Jan-Niklas M; Li, Jiajie; Zhang, Zhiru; Hagmann, Jörg; Nahnsen, Sven; Prins, Pjotr; Garrison, Erik (July 2024, Bioinformatics)
Robinson, Peter (Ed.)
Abstract MotivationThe increasing availability of complete genomes demands for models to study genomic variability within entire populations. Pangenome graphs capture the full genomic similarity and diversity between multiple genomes. In order to understand them, we need to see them. For visualization, we need a human-readable graph layout: a graph embedding in low (e.g. two) dimensional depictions. Due to a pangenome graph’s potential excessive size, this is a significant challenge. ResultsIn response, we introduce a novel graph layout algorithm: the Path-Guided Stochastic Gradient Descent (PG-SGD). PG-SGD uses the genomes, represented in the pangenome graph as paths, as an embedded positional system to sample genomic distances between pairs of nodes. This avoids the quadratic cost seen in previous versions of graph drawing by SGD. We show that our implementation efficiently computes the low-dimensional layouts of gigabase-scale pangenome graphs, unveiling their biological features. Availability and implementationWe integrated PG-SGD in ODGI which is released as free software under the MIT open source license. Source code is available at https://github.com/pangenome/odgi.
more » « less
Full Text Available
Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation

https://doi.org/10.1093/bioinformatics/btad512

Kille, Bryce; Garrison, Erik; Treangen, Todd J; Phillippy, Adam M (September 2023, Bioinformatics)
Robinson, Peter (Ed.)
Abstract MotivationThe Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates. ResultsTo address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications. Availability and implementationMashMap3 is available at https://github.com/marbl/MashMap.
more » « less
Full Text Available
Unbiased pangenome graphs

https://doi.org/10.1093/bioinformatics/btac743

Garrison, Erik; Guarracino, Andrea (November 2022, Bioinformatics)
Alkan, Can (Ed.)
Abstract Motivation Pangenome variation graphs model the mutual alignment of collections of DNA sequences. A set of pairwise alignments implies a variation graph, but there are no scalable methods to generate such a graph from these alignments. Existing related approaches depend on a single reference, a specific ordering of genomes or a de Bruijn model based on a fixed k-mer length. A scalable, self-contained method to build pangenome graphs without such limitations would be a key step in pangenome construction and manipulation pipelines. Results We design the seqwish algorithm, which builds a variation graph from a set of sequences and alignments between them. We first transform the alignment set into an implicit interval tree. To build up the variation graph, we query this tree-based representation of the alignments to reduce transitive matches into single DNA segments in a sequence graph. By recording the mapping from input sequence to output graph, we can trace the original paths through this graph, yielding a pangenome variation graph. We present an implementation that operates in external memory, using disk-backed data structures and lock-free parallel methods to drive the core graph induction step. We demonstrate that our method scales to very large graph induction problems by applying it to build pangenome graphs for several species. Availability and implementation seqwish is published as free software under the MIT open source license. Source code and documentation are available at https://github.com/ekg/seqwish. seqwish can be installed via Bioconda https://bioconda.github.io/recipes/seqwish/README.html or GNU Guix https://github.com/ekg/guix-genomics/blob/master/seqwish.scm.
more » « less
Full Text Available

« Prev Next »

Search for: All records